Scripting for large-scale sequencing based on Hadoop
نویسندگان
چکیده
Motivation and Objectives The large volumes of data generated by modern sequencing experiments present significant challenges in their manipulation and analysis. Traditional approaches, such as scripting and relational database queries, are often found to be inadequate, frustratingly slow, or complicated to scale. These problems have already been faced by the “big data revolution” in data-based activities resulting in novel computational paradigms such as MapReduce and scalable tools such as Hadoop and Pig. We describe our ongoing work on SeqPig, a tool that facilitates the use of the Pig Latin scripting language to manipulate, analyze and query sequencing data. SeqPig provides access to popular data formats and implements a number of high level functions. Most importantly, it grants users access to the proven to be scalable platform that is Hadoop from a high level scripting language, whether the cluster is run locally or in the cloud.
منابع مشابه
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SUMMARY Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query se...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملExploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig
The trace consists of cell information of about 29 days spanning across 700k jobs. This paper deals with statistical analysis of this cluster trace. Since the size of trace is very large, Hive which is a Hadoop distributed file system (HDFS) based platform for querying and analysis of Big data, has been used. Hive was accessed through its Beeswax interface. The data was imported into HDFS throu...
متن کاملBioPig: a Hadoop-based analytic toolkit for large-scale sequence data
MOTIVATION The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. RESULTS We built BioPig on the Apach...
متن کاملHadoop-BAM: directly manipulating next generation sequencing data in the cloud
Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013